Automatic Web Page Categorization by Link and Context Analysis

نویسندگان

  • Giuseppe Attardi
  • Antonio Gullì
  • Fabrizio Sebastiani
چکیده

Assistance in retrieving documents on the World Wide Web is provided either by search engines, through keyword-based queries, or by catalogues, which organize documents into hierarchical collections. Maintaining catalogues manually is becoming increasingly difficult, due to the sheer amount of material on the Web; it is thus becoming necessary to resort to techniques for the automatic classification of documents. Automatic classification i s traditionally performed by extracting the information for representing a document (“indexing”) from the document itself. The paper describes the novel technique of categorization by context, which instead extracts useful information for classifying a document from the context where a URL referring to it appears. We present the results of experimenting with Theseus, a classifier that exploits this technique.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Topic Continuity for Web Document Categorization and Ranking

PageRank is primarily based on link structure analysis. Recently, it has been shown that content information can be utilized to improve link analysis. We propose a novel algorithm that harnesses the information contained in the history of a surfer to determine his topic of interest when he is on a given page. As the history is unavailable until query time, we guess it probabilistically so that ...

متن کامل

Web prefetching through automatic categorization

The present report provides a novel transparent and speculative algorithm for content based web page prefetching. The proposed algorithm relies on a user profile that is dynamically generated when the user is browsing the Internet and is updated over time. The objective is to reduce the user perceived latency by anticipating future actions. In doing so the adaboost algorithm is used in order to...

متن کامل

DISTRIBUTED APPROACH to WEB PAGE CATEGORIZATION USING MAP- REDUCE PROGRAMMING MODEL

The web is a large repository of information and to facilitate the search and retrieval of pages from it, categorization of web documents is essential. An effective means to handle the complexity of information retrieval from the internet is through automatic classification of web pages. Although lots of automatic classification algorithms and systems have been presented, most of the existing a...

متن کامل

A Automatic Evaluation of Interfaces on the Internet

Empirical methods in human-computer interaction (HCI) are very expensive, and the large number of information systems on the Internet requires great efforts for their evaluation. Automatic methods try to evaluate the quality of Web pages without human intervention in order to reduce the cost for evaluation. However, automatic evaluation of an interface cannot replace usability testing and other...

متن کامل

A Novel Web Page Categorization Algorithm Based on Block Propagation Using Query-Log Information

Most existing web page classification algorithms, including contentbased, link-based, or query-log analysis methods, treat the pages as smallest units. However, web pages usually contain some noisy or biased information which could affect the performance of classification. In this paper, we propose a Block Propagation Categorization (BPC) algorithm which deep mines web structure and views block...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999